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1 Transparent checkpointing and rollback recovery mechanism for Windows NT 
^ applications 

^ Youhui Zhang, Dongsheng Wang, Weimin Zheng 

April 2001 ACM SIGOPS Operating Systems Review, volume 35 issue 2 
Publisher: ACM Press 

Full text available: ^ pdf (500.64 KB) Additional Information: full citation, ab stract , index terms 

Clusters of industry-standard computers running Windows NT are emerging as a 
competitive alternative for large-scale parallel computing. However, clusters have 
increased susceptibility to failure especially when they contain many nodes. Therefore it is 
necessary to implement high availability on Windows NT. This paper introduces the 
Checkpoint and Rollback Recovery (CRR) mechanism on Windows NT and presents 
WinNTCkpt, a Checkpointing and recovery tool implemented by us. WinNT ... 

Keywords: API interception, availability, checkpoint and rollback recovery, fault 
tolerance, thread injection 



A model of the performance of a rollback al g orithm 
Fred J. Maryanski, Kirk A. Norsworthy 

January 1979 Proceedings of the 1979 annual conference 
Publisher: ACM Press 

Full text available: pdf( 558.77 K B) Additional Information: MLcrtatjon, abstra ct, rejen^nces, ind ex te rms 

The performance characteristics of a rollback algorithm are analyzed in a simulation 
experiment. An overview of the operation of the rollback algorithm is presented, followed 
by a discussion of the simulation model and its parameters. The model, as implemented, 
consists of data definition, data manipulation command processing, and rollback facilities. 
The model is parameterized in terms of number of application tasks and amount of data 
sharing and driven by randomized streams of data manipu ... 

3 Checkpointin g-based rollback recovery for parallel a p plications on the InteGrade grid 
middleware 

Raphael Y. de Camargo, Andrei Goldchleger, Fabio Kon, Alfredo Goldman 
October 2004 Proceedings of the 2nd workshop on Middleware for grid computing 
Publisher: ACM Press 

Full text available: ^|pdf (1 37.21 KB ) Additional Information: full citation , abstract, references, index terms 

InteGrade is a grid middleware infrastructure that enables the use of idle computing 
power from user workstations. One of its goals is to support the execution of long-running 
parallel applications that present a considerable amount of communication among 
application nodes. However, in an environment composed of shared user workstations 
spread across many different LANs, machines may fail, become unaccessible, or may 
switch from idle to busy very rapidly, compromising the execution of the par ... 



Keywords: BSP, checkpointing, fault-tolerance, grid computing 



A survey of rollback-recovery protocols in messag e- passin g systems 
E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson 
September 2002 ACM Computing Surveys (CSUR), volume 34 issue 3 
Publisher: ACM Press 

Full text available: « pdf(549.68 KB) Additlonal Information: fuHdtaUon . abstract, Leferences. dtjngs, index 

terms, review 

This survey covers rollback-recovery techniques that do not require special language 
constructs. In the first part of the survey we classify rollback-recovery protocols into 
checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing 
for system state restoration. Checkpointing can be coordinated, uncoordinated, or 
communication-induced. Log-based protocols combine checkpointing with logging of 
nondeterministic events, encoded in tuples call ... 

Keywords: message logging, rollback-recovery 



ARIES : a t ransaction recov er y me thod supporting fine-granularity locking and partial 
rollbac ks using write-ahead Jogging 

C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, Peter Schwarz 

March 1992 ACM Transactions on Database Systems (TODS), volume n issue l 

Publisher: ACM Press 

Full text available - IS pdf( 5 23 MB) Additional Information: full citation , abstract, refere nces, citings, index 

terms , review 

DB2TM, IMS, and TandemTM systems. ARIES is applicable not only to database 
management systems but also to persistent object-oriented languages, recoverable file 
systems and transaction-based operating systems. ARIES has been implemented, to 
varying degrees, in IBM's OS/2TM Extended Edition Database Manager, DB2, Workstation 
Data Save Facility/VM, Starburst and Quicksilver, and in the University of Wisconsin's 
EXODUS and Gamma d ... 

Keywords: buffer management, latching, locking, space management, write-ahead 
logging 



6 Concurrency control in advanced database a pplications 
Naser S. Barghouti, Gail E. Kaiser 

September 1991 ACM Computing Surveys (CSUR), Volume 23 issue 3 
Publisher: ACM Press 

Full text available: ^| pdf ( 4.69 MB ) Additional Information: full citation , references , citings, index terms 




Keywords: advanced database applications, concurrency control, cooperative 
transactions, design environments, extended transaction models, long transactions, 
object-oriented databases, relaxing serializability 



7 On the relevance of communicati on cost s of rollbac k-recovery protocols 
E. N. Elnozahy 

August 1995 Proceedings of the fourteenth annual ACM symposium on Principles of 
distributed computing 

Publisher: ACM Press 

Full text available: ^ pdf( 670.77 KB ) Additional Information: full citation , references , citings, index terms 




8 An analysis of rollback-based simulation 
Boris Lubachevsky, Adam Schwartz, Alan Weiss 



April 1991 ACM Transactions on Modeling and Computer Simulation (TOMACS), volume 

1 Issue 2 
Publisher: ACM Press 

Full text available* 151 pdf(2 47 MB) Additional Information: full citation, references, citings, index terms. 

review 



9 Session 4: AtomCaml: first-class atomicity via rollback 
Michael F. Ringenburg, Dan Grossman 

September 2005 Proceedings of the tenth ACM SIGPLAN international conference on 
Functional programming ICFP '05 

Publisher: ACM Press 

Full text available: ^pdf(244. 09 KB ) Additional Information: fuiLcitation, abstract, references, index terms 

We have designed, implemented, and evaluated AtomCaml, an extension to Objective 
Caml that provides a synchronization primitive for atomic (transactional) execution of 
code. A first-class primitive function of type (unit->'a)-> f a evaluates its argument (which 
may call other functions, even external C functions) as though no other thread has 
interleaved execution. Our design ensures fair scheduling and obstruction-freedom. Our 
implementation extends the Objective Caml bytecode compiler and ... 

Keywords: atomicity, concurrent programming, objective caml, transactions 




10 Multiplexed state savin g for bounded rollback 
Fabian Gomes, Brian Unger, John Cleary, Steve Franks 

December 1997 Proceedings of the 29th conference on Winter simulation 
Publisher: ACM Press 

Full text available: *g[ pdf(888.13 KB ) Additional Information: full citation , references , citing s, index terms 




11 lsSC + ILP = RC? 

Chris Gniady, Babak Falsafi, T. N. Vijaykumar 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture ISCA '99, volume 

27 Issue 2 

Publisher: IEEE Computer Society, ACM Press 

Full text available: ffipdf( 94.82 KB ) Additional Information: full citation , abstract , references , citings, index 
fP Publisher Site 

Sequential consistency (SC) is the simplest programming interface for shared-memory 
systems but imposes program order among all memory operations, possibly precluding 
high performance implementations. Release consistency (RC), however, enables the 
highest performance implementations but puts the burden on the programmer to specify 
which memory operations need to be atomic and in program order. This paper shows, for 
the first time, that SC implementations can perform as well as RC implementations ... 




12 Global transaction support for workflow mana g ement systems: from formal 
s pecification to practical implementation 
Paul Grefen, Jochem Vonk, Peter Apers 

December 2001 The VLDB Journal — The International Journal on Very Large Data 

Bases, Volume 10 Issue 4 
Publisher: Springer-Verlag New York, Inc. 

Full text available: ^| pdf ( 260.06 KB ) Additional Information: full citation, abstract, citings, index terms 

In this paper, we present an approach to global transaction management in workflow 
environments. The transaction mechanism is based on the well-known notion of 
compensation, but extended to deal with both arbitrary process structures to allow cycles 
in processes and safepoints to allow partial compensation of processes. We present a 
formal specification of the transaction model and transaction management algorithms in 
set and graph theory, providing clear, unambiguous transaction semantics. The ... 



Keywords: Compensation, Long-running transaction, Transaction Management, Workflow 
management 



13 O ptimistic simulation of parallel messa ge- passin g ap plications 
Thomas Phan, Rajive Bagrodia 

May 2001 Proceedings of the fifteenth workshop on Parallel and distributed 
simulation 

Publisher: IEEE Computer Society 

Full text available: fi3 p_df(71 9J2_KB) 

s Additional Information: full citation , abstract , references , index terms 

^ P ublisher S ite 

Optimistic techniques can improve the performance of discrete-event simulations, but one 
area where optimistic simulators have been unable to show performance improvement is 
in the simulation of parallel programs. Unfortunately parallel program simulation using 
direct execution is difficult: the use of direct execution implies that the memory and 
computation requirements of the simulator are at least as large as that of the target 
application, which restricts the target systems and applica ... 



14 ReVive: cost-effective architectural su p port for rollback recovery in shared-memor y 
^ multiprocessors 

^ Milos Prvulovic, Zheng Zhang, josep Torrellas 

May 2002 ACM SIGARCH Computer Architecture News , Proceedings of the 29th 
annual international symposium on Computer architecture ISCA '02 , 
Proceedings of the 29th annual international symposium on Computer 
architecture ISCA '02, volume 30 issue 2 

Publisher: IEEE Computer Society, ACM Press 

Full text available: ^ ^ MB) ^ Additional Information: full citation, abstract , ref erences , citings, i ndex 
Publisher Site 

This paper presents ReVive, a novel general-purpose rollback recovery mechanism for 
shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of 
availability, performance, and hardware cost. ReVive performs checkpointing, logging, 
and distributed parity protection, all memory-based. It enables recovery from a wide class 
of errors, including the permanent loss of an entire node. To maintain high performance, 
ReVive includes specialized hardware that performs frequent o ... 

Keywords: fault tolerance, shared-memory multiprocessors, rollback recovery, recovery, 
BER, logging, parity, checkpointing, availability 



15 A study of time warp rollback mechanisms 
Yi-Bing Lin, Edward D. Lazowska 

January 1991 ACM Transactions on Modeling and Computer Simulation (TOMACS), 

Volume 1 Issue 1 
Publisher: ACM Press 

Full text available - 151 pdf(1 31 MB) Additional Information: full citation , abstract , references, citings, index 
. iaj : terms , review 

The Time Warp "optimistic" approach is one of the most important parallel simulation 
protocols. Time Warp synchronizes processes via rollback. The original rollback 
mechanism called lazy cancellation has aroused great interest. This paper studies these 
rollback mechanisms. The general tradeoffs between aggressive and lazy cancellation are 
discussed, and by a conservitive-optimal simulation is defined for comparitive purposes. 
Within the framewor ... 

Keywords: Time Warp approach, aggressive cancellation, discrete-event simulation, lazy 
cancellation, parallel simulation 



16 The WarplV Simulatio n Kernel 
Jeffrey S. Steinman 

June 2005 Proceedings of the 19th Workshop on Principles of Advanced and 



Distributed Simulation PADS 05 

Publisher: IEEE Computer Society 

Full text available:^ pdfd .28 MB) Additional Information: full citation , abstract 

This paper provides an overview of the WarpIV Simulation Kernel that was designed to be 
an initial implementation of the Standard Simulation Architecture (SSA). WarpIV is the 
next generation replacement for the Synchronous Parallel Environment for Emulation and 
Discrete Event Simulation (SPEEDES) framework that has supported a number of DoD 
simulation programs including MDWAR, EADTB, JSIMS, and others. This paper first 
provides a look back at the historical evolution of SPEEDES, the evolution of ... 

17 Logg ed virtual memory 
D. R. Cheriton, K. J. Duda 

December 1995 ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth 
ACM symposium on Operating systems principles SOSP '95, volume 29 
Issue 5 

Publisher: ACM Press 

Full text available: ^|pdf( 1.52 MB ) Additional Information: full citation , references , index t erms 




18 Cost of state saving & rollback 

^ John Cleary, Fabian Gomes, Brian Unger, Zhonge Xiao, Raimar Thudt 

j u |y 1994 ACM SIGSIM Simulation Digest , Proceedings of the eighth workshop on 

Parallel and distributed simulation PADS '94, volume 24 issue l 
Publisher: ACM Press 

Full text available- fi3 df(722 1 1 KB) Additional Information: full citation, abstract, reference s, citings, index 
u e avai a e.-gJjL-l : terms 

Approaches to state saving and rollback for a shared memory, optimistically synchronized, 
simulation executive are presented. An analysis of copy state saving and incremental 
state saving is made and these two schemes are compared. Two benchmark programs are 
then described, one a simple, all overhead, model and one a performance model of a 
regional Canadian public telephone network. The latter is a large SS7 common channel 
signalling model that represents a very challenging, practical, test ... 
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The use of formal methods in hardware design improves the quality of designs in many 
ways: it promotes better understanding of the design; it permits systematic design 
refinement through the discovery of invariants; and it allows design verification (informal 
or formal). In this paper we illustrate the use of formal methods in the design of a custom 
hardware system called the "Rollback Chip" (RBC), conducted using a simple hardware 
design description language called u HOP&r ... 
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Availability requirements for database systems are more stringent than ever before with 
the widespread use of databases as the foundation for ebusiness. This paper highlights 
Fast-Start™ Fault Recovery, an important availability feature in Oracle, designed to 
expedite recovery from unplanned outages. Fast-Start allows the administrator to 
configure a running system to impose predictable bounds on the time required for crash 
recovery. For instance, fast-start allows fine-gr ... 
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In optimistic parallel simulations, state-saving techniques have traditionally been used to 
realize rollback. In this article, we propose reverse computation as an alternative 
approach, and compare its execution performance against that of state-saving. Using 
compiler techniques, we describe an approach to automatically generate reversible 
computations, and to optimize them to reap the performance benefits of reverse 
computation transparently. For certain fine-grain models, ... 
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We describe an efficient server-based algorithm for garbage collecting persistent object 
stores in a client-server environmnet. The algorithm is incremental and runs concurrently 
with client transactions. Unlike previous algorithms, it does not hold any transactional 
locks on data and does non require callbacks to clients. It is fault-tolerant, but performs 



very little logging. The algorithm has been designed to be integrated into existing 
systems, and therefore it works with standard i ... 
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Database replication is traditionally seen as a way to increase the availability and 
performance of distributed databases. Although a large number of protocols providing 
data consistency and fault-tolerance have been proposed, few of these ideas have ever 
been used in commercial products due to their complexity and performance implications. 
Instead, current products allow inconsistencies and often resort to centralized approaches 
which eliminates some of the advantages of replication. As an ... 
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For reasons of simplicity and communication efficiency, a number of existing object- 
oriented database management systems are based on page server architectures; data 
pages are their minimum unit of transfer and client caching. Despite their efficiency, page 
servers are often criticized as being too retrictive when it comes to concurrency, as 
existing systems use pages as the minimum locking unit as well. In this paper we show 
how to support object-level locking in a page-server context. Sev ... 
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Cactis is an object-oriented, multiuser DBMS developed at the University of Colorado. The 
system supports functionally-defined data and uses techniques based on attributed 
graphs to optimize the maintenance of functionally-defined data. The implementation is 
self-adaptive in that the physical organization and the update algorithms dynamically 
change in order to reduce disk access. The system is also concurrent. At any given time 
there are some number of computations that must be performed t ... 
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We have designed, implemented, and evaluated AtomCaml, an extension to Objective 
Caml that provides a synchronization primitive for atomic (transactional) execution of 
code. A first-class primitive function of type (unit-> , a)->'a evaluates its argument (which 
may call other functions, even external C functions) as though no other thread has 
interleaved execution. Our design ensures fair scheduling and obstruction-freedom. Our 
implementation extends the Objective Caml bytecode compiler and ... 
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Speculator provides Linux kernel support for speculative execution. It allows multiple 
processes to share speculative state by tracking causal dependencies propagated through 
inter-process communication. It guarantees correct execution by preventing speculative 
processes from externalizing output, e.g., sending a network message or writing to the 
screen, until the speculations on which that output depends have proven to be correct. 
Speculator improves the performance of distributed file systems ... 
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Client-server object-oriented database. management systems differ significantly from 
traditional centralized systems in terms of their architecture and the applications they 
target. In this paper, we present the client-server architecture of the EOS storage 
manager and we describe the concurrency control and recovery mechanisms it employs. 
EOS offers a semi-optimistic locking scheme based on the multi-granularity two-version 
two-phase locking protocol. Under this scheme, multiple concurrent reade ... 

Keywords: Checkpoint, Client-server architecture, Object management, Concurrency 
control, Locking, Logging, Recovery, Transaction management 





11 An analysis of rollback-based simulation 



Boris Lubachevsky, Adam Schwartz, Alan Weiss 



April 1991 ACM Transactions on Modeling and Computer Simulation (TOMACS), volume 

1 Issue 2 
Publisher: ACM Press 

Full text available- 19 pdf(2 47 MB) Additional Information: fejJLcjtatipn, references, citings, index terms, 
■ Ta}JL_L-. review 



1 2 Recovery guarantees for Internet _a ppj ication s 

Roger Barga, David Lomet, German Shegalov, Gerhard Weikum 
August 2004 ACM Transactions on Internet Technology (TOIT), volume 4 issue 3 
Publisher: ACM Press 

Full text available: ^pdf (997.52 KB) Additional Information: full citation , abstract , references , index terms 

Internet- based e-services require application developers to deal explicitly with failures of 
the underlying software components, for example web servers, servlets, browser 
sessions, and so forth. This complicates application programming, and may expose 
failures to end users. This paper presents a framework for an application-independent 
infrastructure that provides recovery guarantees and masks almost all system failures, 
thus relieving the application programmer from having to deal with these f ... 
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In this paper we examine the recovery time in a database system using a Write-Ahead 
Log protocol, such as ARIES [9], under the assumption that the buffer replacement policy 
is strict LRU. In particular, analytical equations for log read time, data I/O, log application, 
and undo processing time are presented. Our initial model assumes a read/write ratio of 
one, and a uniform access pattern. This is later generalized to include different read/write 
ratios, as well as a "hot set" m ... 
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Chopping transactions into pieces is good for performance but may lead to nonserializable 
executions. Many researchers have reacted to this fact by either inventing new 
concurrency-control mechanisms, weakening serializability, or both. We adopt a different 
approach. We assume a user who— has access only to user-level tools such as (1) 
choosing isolation degrees lndash;4, (2) the ability to execute a portion of a transaction 
using multiversion read consistency, and (3) the a ... 
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Business processes executing in peer-to-peer environments usually invoke Web services 
on different, independent peers. Although peer-to-peer environments inherently lack 
global control, some business processes nevertheless require global transactional 
guarantees, i.e., atomicity and isolation applied at the level of processes. This paper 
introduces a new decentralized serialization graph testing protocol to ensure concurrency 
control and recovery in peer-to-peer environments. The uniqueness oft ... 
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Optimistic Recovery is a new technique supporting application-independent transparent 
recovery from processor failures in distributed systems. In optimistic recovery 
communication, computation and checkpointing proceed asynchronously. Synchronization 
is replaced by causal dependency tracking, which enables a posteriori reconstruction of a 
consistent distributed system state following a failure using process rollback and message 
replay. 
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There is good reason to base intrusion detection on data from the host. Unfortunately, 
most operating systems do not provide all the data needed in readily available logs. 
Ironically, perhaps, Window NT and its successor, Windows 2000, provide much of the 
necessary data, at least for security events. We have developed a host-based intrusion 
detector for these platforms that meets the generally accepted criteria for a good 
Intrusion Detection System. Its architecture is sufficiently flexible t ... 
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